January 10, 2000
Stump the Geeks
Section: Connect
Edition: Final
Page: D6
Estimated Printed Pages: 2
Index Terms:
hi-tech
Letter
Article Type:Letter
Article Text:
Q. I'm putting historical material online for the public and have a problem: I want to put the pages online as page-images for accuracy, but also include hidden OCR'd text from the pages just for searching purposes. (Because HTML can't display
OCR'd text EXACTLY like it is on the original pages and proofing kills you if there are thousands of pages.)
I thought of putting the OCR'd text within Comment Tags, "" but AltaVista says they don't index the contents of Comments.
Any suggestions?
Dave Maxey
Apex
A. You may be surprised to know that you have encountered not only a limitation of AltaVista, but a limitation of the Web.
Web search sites like AltaVista are in a way a compromise between solving two distinct problems, often known as "resource discovery" and "deep searching." The problem of finding your historical collection and related collections is an example of
resource discovery.
Searching within your collection in a comprehensive way is an example of deep searching.
AltaVista does both at the same time, but neither extremely well; it views all data as a massive set of Web pages, and that is one reason that search results often contain a jumble of what appear sometimes to be links to sites, databases,
specific documents, etc. (The development of XML will improve the situation, but it does not solve the underlying problem.)
For many people who use Web searching as a starting point for browsing, this approach is adequate. However, if you are trying to build your own document collection, you may find that you need features of deep searching beyond merely indexing Web
pages.
The solution to this is to run your own search engine, if you are somewhat technically inclined. Otherwise you may have to choose an unsatisfying way to work around AltaVista, for example by making your text invisible in the HTML.
Another advantage of setting up your own search engine is that it will allow you to experiment with other formats such as PDF. PDF can represent the original documents exactly as scanned, with the OCR'd text included inside the document. Several
search engines are able to index PDF files directly. This means that you would have to maintain only a single version of each document, and your users would get both the correct image and the text in a single document.
None of this answers the question: What is the best way for your collection to be searched with AltaVista? This is where we return to the limits of the Web. Are your data a "collection" or an indistinct area of "the Web"? Are they merely a part
of the HTMLting pot? (Thank Paul Jones of Stump the Geeks for the pun.) Are they a resource worth being discovered?
Nassib Nassar
president of Etymon Stystems Inc.
www.etymon.com
Copyright 2000 by The News & Observer Pub. Co.
Record Number: fo4qrk89