Retrieving Information from Dynamic Knowledge Repositories

Hastings Research home > Internet Papers Index > 02-dkr-ir-intro.shtml

Retrieving Information from
Dynamic Knowledge Repositories - Overview

Nicholas Carroll
Date: January 27, 2001
Modified: N/A

Doug Engelbart's Open Hyperdocument System proposes Dynamic Knowledge Repositories (DKRs). Among other features, DKRs will need powerful information retrieval capabilities. (Since someone will abbreviate this process by the end of the week, we might as well do so right now, and call it DKR/IR.)

The explosion of the World Wide Web has – one hopes – taught most thoughtful observers something that was known to information science people decades ago: information retrieval does not scale well. Increasing a search base by an order of magnitude can create a whole new game. In fact, decreasing a search base by an order of magnitude can create a whole new game.

Over the last few years of ecommerce web site architecture I spent several thousand hours studying electronic search from various directions – how search software works, how users think, and how to structure search logic and data structures to meet users in the middle. While my mathematics are occasionally naive, these thoughts come from fairly hard lessons in what works – and what does not.

Why am I writing this? There will be several small papers flowing off this one. Frankly, I don't expect many OHS developers to read them all. The point of the exercise is to get IR requirements on the radar screen, for those who are doing deep thinking about lower-level architecture. When finished, I'll write a synopsis.

(Note: "information retrieval," as IR people use it, means finding information. Actually retrieving the information onto your computer screen is called "information access." An unfortunate choice of terms, but there we are.)

The Basis of WWW Information Retrieval
Present Search Tools
Current Incarnations On the Web
The Smaller DKR
The Layers That Affect Information Retrieval

The Basis of WWW Information Retrieval

Very recently, in a galaxy which now seems far away, hightech startup companies were going to make the entire Internet searchable with "95% relevance." Or was it 98%? In either case, the standard claim was "We're already at 70%!"

Then they hit what a leading information science researcher once called "the slippery slope to artificial intelligence." Algorithms (the word has lost most meaning in the last few years) lost much of their sheen. Natural Language Processing (NLP) is likewise tarnished, and most NLP operations now have a few dozen workers in the back room, tweaking searches that the computers could not understand. Likewise for the extremely lowbrow AI of autoresponders – the more astute organizations have discovered that the proper response to "Stop sending catalogs!" is not "Thank you! Your catalog is on the way!"

Present Search Tools

Search tools existed long before the Internet spawned the WWW. Of the major types – hierarchies, keywords, and databases – it is the first two that have appeared most often on the Web. And since the Web is familiar to most readers, I'll start there:

Hierarchies (directories)
Yahoo! was the original, or at least the first popularized hierarchical directory, indexing the Web by a process of manual submissions and editorial review. For a long time they employed 40 editors to index the Web. Looksmart came along shortly thereafter, and employed over 100 editors. Both approaches utterly failed to keep pace with the growth of the web. In 1997, Yahoo! had fallen to cataloging only 20% of new submissions. Today they have been left in the dust by the Open Directory Project (ODP), with over 30,000 volunteer editors – which in turn cannot keep pace with the expansion of the web.

Keywords (search engines)
I suspect search engines were conceived by some fairly naive programmers (mathematicians would have known better?). Algorithms would beautifully index the entire Web, they thought. They thought wrong. Pornographers quickly proved the algorithms were hugely subject to manipulation. Today those algorithms are buttressed by spam lists, external weighting schemes, and manual intervention. The clean math has become heavily filtered, the data well-massaged.

Database searches have not made a big entry to the web. I imagine this is because a) it requires at least a little user sophistication to write a query, no matter how good the interface is, and b) web sites are typically built with a mixture of design, Perl, and Java skills; when database knowledge is involved, it is usually buried at the back end. (Note: many search engines use tabling methods to speed search, but there is no database mindset at the user level.)

Current Incarnations On the Web

Today we can see several variants:

Companies that kept hammering at the mathematics, Google.com probably being the most successful.

Those who turned their backs on the mathematics, and have been turning to the manipulation of the data structures, indexing, and cataloging, variously calling it "content synthesis," "knowledge integration," and other terms. These are typically startups subcontracting to major organizations. AskJeeves.com is the most visible of the consumer-oriented ones.

A third group moved towards established data schemes, adopting classification systems such as Library of Congress cataloging. I have been through this, and it rarely works. Catalogers become rigid over time, and gradually fall out of touch with users. Amazon.com relies heavily on catalogers, and their search suffers heavily as a result.

A fourth and much smaller group has synthesized mathematics with their own sorting and cataloging methods. NorthernLight.com is a leading example.

A fifth group of more complex mutations, which I won't be dealing with right now, includes search algorithms, editors, user voting, and subject clumping schemes such as "themes" or topic maps. There are dozens (hundreds?) of these. Alexa.com and Oingo.com are two of the more prominent.

Note: most directories now have a search engine tacked on as an additional resource, and likewise most search engines index the ODP for its more precise cataloging.

The Smaller DKR

In structuring DKR information retrieval, I favor a variant of the fourth approach: synthesizing good algorithms with good data structures – and then allowing some user access to both.

Some opinions about the path to quality DKR/IR:

In expanding knowledge systems, pre-conceived structures are inevitably ill-conceived structures. This is guaranteed by the facts that a) the knowledge structure will meet users with unanticipated frames of reference, and b) AI is advancing. Either fact can render today's approach obsolete. This is true for all types of data storage/retrieval schemes.
It would be nice to have closure for keyword, relational, and hierarchical searches.
Hierarchies, to my mind, are not entirely dead. I'm of the opinion that new forms of hierarchical search will make a return in the not too far future, displayed as topographies.
Database designers, on the whole, are constitutionally not equipped to design this sort of extensible data structure, as they are in the habit of locking users out of data areas.
OHS information retrieval needs to be two-footed for the immediate future – the best IR systems will be an amalgam of good algorithms and good data structures. I call the latter "intelligent databases" – in the dual sense that they are designed intelligently, and the data can be augmented (tweaked or enhanced) by human intelligence.

The Layers That Affect Information Retrieval

Moving towards the point, this the first of a few short papers on retrieving information from DKRs. This "layers" description is a tops-down user perspective; it does not precisely reflect software architecture.

I'm addressing these in no particular order. Finished papers are linked.

Viewing tool - browser, plugin, whatever interface

Search configuration - user level

Cataloging by object author/creator

Search configuration - administrator level

Search code

Administrator configuration of data and metadata superstructures

Design of object storage structures

Objects - documents, video, audio, other

References

Toward High-Performance Organizations: A Strategic Role for Groupware, Douglas C. Engelbart, June 1992
http://www.bootstrap.org/augment-132811.htm

The Invisible Substrate of Information Science, Marcia J. Bates
http://www.gseis.ucla.edu/faculty/bates/substrate.html
Bates is a thorough and thoughtful information science scholar. I do not always agree, but I always pay attention.

Human, Database, and Domain Factors in Content Indexing and Access to Digital Libraries and the Internet, Marcia J. Bates

The Memory Palace of Matteo Ricci, Jonathan Spence
Synopsis: http://www.hastingsresearch.com/cgi-bin/ax.cgi?http://www.tolearn.net/marketing/palace.htm
Ricci published in 1596. He was an early founder of information storage and retrieval.

Retrieval Structure Manipulations
http://muskingum.edu/~cal/database/Encoding6.html

Name Spaces As Tools for Integrating the Operating System Rather Than As Ends in Themselves, Hans Reiser
http://www.namesys.com/whitepaper.html

Please send comments to Nicholas Carroll
Email: ncarroll@hastingsresearch.com

Keywords: Open Hyperdocument System, OHS, Bootstrap, Doug, Douglas, Engelbart, Augment.

http://www.hastingsresearch.com/net/02-dkr-ir-intro.shtml
© 1999-2001 Hastings Research, Inc. All rights reserved.