Nicholas
Carroll Date: January 27, 2001 Modified:
N/A
Doug Engelbart's Open Hyperdocument System proposes Dynamic Knowledge
Repositories (DKRs). Among other features, DKRs will need powerful
information retrieval capabilities. (Since someone will abbreviate this
process by the end of the week, we might as well do so right now, and call
it DKR/IR.)
The explosion of the World Wide Web has – one hopes – taught most
thoughtful observers something that was known to information science
people decades ago: information retrieval does not scale well. Increasing
a search base by an order of magnitude can create a whole new game. In
fact, decreasing a search base by an order of magnitude can create a whole
new game.
Over the last few years of ecommerce web site architecture I spent
several thousand hours studying electronic search from various directions
– how search software works, how users think, and how to structure search
logic and data structures to meet users in the middle. While my
mathematics are occasionally naive, these thoughts come from fairly hard
lessons in what works – and what does not.
Why am I writing this? There will be several small papers
flowing off this one. Frankly, I don't expect many OHS developers to read
them all. The point of the exercise is to get IR requirements on the radar
screen, for those who are doing deep thinking about lower-level
architecture. When finished, I'll write a synopsis.
(Note: "information retrieval," as IR people use it, means finding
information. Actually retrieving the information onto your computer screen
is called "information access." An unfortunate choice of terms, but there
we are.)
The
Basis of WWW Information Retrieval
Present
Search Tools
Current
Incarnations On the Web
The
Smaller DKR
The
Layers That Affect Information Retrieval
The
Basis of WWW Information Retrieval
Very recently, in a galaxy which now seems far away, hightech startup
companies were going to make the entire Internet searchable with "95%
relevance." Or was it 98%? In either case, the standard claim was "We're
already at 70%!"
Then they hit what a leading information science researcher once called
"the slippery slope to artificial intelligence." Algorithms (the word has
lost most meaning in the last few years) lost much of their sheen. Natural
Language Processing (NLP) is likewise tarnished, and most NLP operations
now have a few dozen workers in the back room, tweaking searches that the
computers could not understand. Likewise for the extremely lowbrow AI of
autoresponders – the more astute organizations have discovered that the
proper response to "Stop sending catalogs!" is not "Thank you! Your
catalog is on the way!"
Present
Search ToolsSearch tools existed long before the Internet spawned the
WWW. Of the major types – hierarchies, keywords, and databases – it is the
first two that have appeared most often on the Web. And since the Web is
familiar to most readers, I'll start there:
Hierarchies (directories) Yahoo! was the original, or at
least the first popularized hierarchical directory, indexing the Web by a
process of manual submissions and editorial review. For a long time they
employed 40 editors to index the Web. Looksmart came along shortly
thereafter, and employed over 100 editors. Both approaches utterly failed
to keep pace with the growth of the web. In 1997, Yahoo! had fallen to
cataloging only 20% of new submissions. Today they have been left in the
dust by the Open Directory Project
(ODP), with over 30,000 volunteer editors – which in turn cannot keep pace
with the expansion of the web.
Keywords (search engines) I suspect search engines were
conceived by some fairly naive programmers (mathematicians would have
known better?). Algorithms would beautifully index the entire Web, they
thought. They thought wrong. Pornographers quickly proved the algorithms
were hugely subject to manipulation. Today those algorithms are buttressed
by spam lists, external weighting schemes, and manual intervention. The
clean math has become heavily filtered, the data well-massaged.
Database searches have not made a big entry to the web. I
imagine this is because a) it requires at least a little user
sophistication to write a query, no matter how good the interface is, and
b) web sites are typically built with a mixture of design, Perl, and Java
skills; when database knowledge is involved, it is usually buried at the
back end. (Note: many search engines use tabling methods to speed search,
but there is no database mindset at the user level.)
Current
Incarnations On the Web
Today we can see several variants:
Companies that kept hammering at the mathematics, Google.com probably being the most
successful.
Those who turned their backs on the mathematics, and have been
turning to the manipulation of the data structures, indexing, and
cataloging, variously calling it "content synthesis," "knowledge
integration," and other terms. These are typically startups subcontracting
to major organizations. AskJeeves.com is the most visible of
the consumer-oriented ones.
A third group moved towards established data schemes, adopting
classification systems such as Library of Congress cataloging. I have been
through this, and it rarely works. Catalogers become rigid over time, and
gradually fall out of touch with users. Amazon.com relies heavily on
catalogers, and their search suffers heavily as a result.
A fourth and much smaller group has synthesized mathematics with
their own sorting and cataloging methods. NorthernLight.com is a leading
example.
A fifth group of more complex mutations, which I won't be dealing with
right now, includes search algorithms, editors, user voting, and subject
clumping schemes such as "themes" or topic maps. There are dozens
(hundreds?) of these. Alexa.com and Oingo.com are two of the more
prominent. Note: most directories now have a search engine tacked on as
an additional resource, and likewise most search engines index the ODP for
its more precise cataloging.
The
Smaller DKR
In structuring DKR information retrieval, I favor a variant of the
fourth approach: synthesizing good algorithms with good data structures –
and then allowing some user access to both.
Some opinions about the path to quality DKR/IR:
- In expanding knowledge systems, pre-conceived structures are
inevitably ill-conceived structures. This is guaranteed by the facts
that a) the knowledge structure will meet users with unanticipated
frames of reference, and b) AI is advancing. Either fact can render
today's approach obsolete. This is true for all types of data
storage/retrieval schemes.
- It would be nice to have closure for keyword, relational, and
hierarchical searches.
- Hierarchies, to my mind, are not entirely dead. I'm of the opinion
that new forms of hierarchical search will make a return in the not too
far future, displayed as topographies.
- Database designers, on the whole, are constitutionally not equipped
to design this sort of extensible data structure, as they are in the
habit of locking users out of data areas.
- OHS information retrieval needs to be two-footed for the immediate
future – the best IR systems will be an amalgam of good algorithms and
good data structures. I call the latter "intelligent databases" – in the
dual sense that they are designed intelligently, and the data can be
augmented (tweaked or enhanced) by human intelligence.
The
Layers That Affect Information Retrieval
Moving towards the point, this the first of a few short papers on
retrieving information from DKRs. This "layers" description is a tops-down
user perspective; it does not precisely reflect software architecture.
I'm addressing these in no particular order. Finished papers are
linked.
- Viewing tool - browser, plugin, whatever interface
- Search configuration - user level
- Cataloging
by object author/creator
- Search configuration - administrator level
- Search code
- Administrator configuration of data and metadata superstructures
- Design of object storage structures
- Objects - documents, video, audio, other
References
Toward
High-Performance Organizations: A Strategic Role for Groupware, Douglas
C. Engelbart, June
1992 http://www.bootstrap.org/augment-132811.htm
The
Invisible Substrate of Information Science, Marcia J.
Bates http://www.gseis.ucla.edu/faculty/bates/substrate.html Bates
is a thorough and thoughtful information science scholar. I do not always
agree, but I always pay attention.
Human, Database, and Domain Factors in Content Indexing and Access to
Digital Libraries and the Internet, Marcia J. Bates
The Memory Palace of Matteo Ricci, Jonathan Spence Synopsis: http://www.hastingsresearch.com/cgi-bin/ax.cgi?http://www.tolearn.net/marketing/palace.htm Ricci
published in 1596. He was an early founder of information storage and
retrieval.
Retrieval
Structure
Manipulations http://muskingum.edu/~cal/database/Encoding6.html
Name
Spaces As Tools for Integrating the Operating System Rather Than As Ends
in Themselves, Hans
Reiser http://www.namesys.com/whitepaper.html |