Everyone undoubtedly has seen announcements of the new, non-Boolean, natural language search techniques from West, DIALOG, and Mead Data Central. Some of you may already be experimenting with or using Westlaw's WIN, DIALOG's TARGET, or Mead's FREE-STYLE. All are based on the assumption that our standard command-driven online systems coupled with Boolean logic searching are not only difficult to learn, but may sometimes miss relevant documents. This assumption is based not only on searchers' experiences, but on years of controlled tests in the information retrieval laboratories. There is no doubt that this new way to search old systems is gaining a lot of attention. Westlaw Is Natural (WIN) was introduced in the fall of 1992 and on its first anniversary won the ONLINE Product of the Year award at ONLINE/CD-ROM '93. It continues to garner favorable reviews and is now the search method that Westlaw trainers teach first to new searchers. DIALOG's TARGET and Mead's FREESTYLE were announced at ONLINE/CD-ROM '93 to great fanfare, and became publicly available in late 1993 and early 1994.
THE NATURAL ALTERNATIVE
Although each product works somewhat differently, all three offer an alternative to searching with command interfaces and Boolean/proximity operators. (Proximity operators such as specifying within a paragraph or within a specified number of words are an extension of the Boolean AND.) They offer (somewhat) natural language input, with no need for commands or logical operators. This input method is coupled with so-called "associative," "probabilistic" or "statistical" retrieval techniques that provide relevance ranking of search results.
Unlike exact-match Boolean logic systems, where all concepts or terms linked with AND or a proximity operator must be present, relevance retrieval techniques are partial-match systems. They retrieve all documents that contain any words used to represent a concept (as if all words were ORd together). These documents are then run against a mathematical algorithm that weights and ranks the documents. Statistical methods compare, for example, how many times the search words appear in each record with how many times they appear in the database as a whole. Documents that contain many of the search words are given higher weights. If those terms appear relatively less frequently in the database as a whole, the documents that contain them are weighted even more heavily. Relative lengths of each document are taken into account as well. The documents are then sorted by the assigned weights to display first those documents that best match the query. Pritchard-Schoch provides a clear explanation of the history and methods of these techniques, which have been tested for decades. They have been available on smaller online or CD-ROM systems and in software for in-house databases for several years [2].
The most important question for experienced searchers is what will relevance search systems retrieve compared to the tried-and-true Boolean search engines? Should experienced searchers use the new methods? Will your search results be better with one technique or the other? Which method should we teach our end-users? Do all of the new systems achieve the same results?
TARGET
DIALOG officially announced TARGET in October 1993 and made it available for all users in December 1993. Although it was under development off and on for several years, it wasn't until WIN's success that DIALOG decided to get TARGET ready for release. The DIALOG development staff looked at many different relevance methods and tested a variety of algorithms before programming what we see now. TARGET works on all DIALOG databases, but it is best suited for full-text databases or those with lengthy abstracts. These are the databases that rely on free-text searching and often retrieve excessive false drops with conventional searching. Since relevance retrieval compares how many times words occur in a document in relation to the length of each document, entire documents about a topic can be differentiated from those with only a single paragraph or a mention in passing of the desired subject. The most relevant documents should be placed at the top of the set for display first in relevance-ranked retrieval.
TARGET FOR SUBJECT SEARCHING
TARGET works best for text searching. Just as with DIALOG's Boolean system, TARGET defaults to the basic index. Those of you who are regular DIALOG searchers are aware of this distinction in its indexes. Unlike NEXIS and many other online systems, DIALOG maintains separate indexes for subject and non-subject searching. The "basic index" typically includes only words from titles, words from abstracts, words from full text, and words or phrases from descriptors or identifiers--fields which are all considered to represent the subjects of documents. The basic index is searched by default if a searcher doesn't specify any particular field. To search for an author, journal name, corporate source, or other non-subject field in Boolean DIALOG, the searcher must explicitly name that field. (e.g., SELECT AU=Asimov, Isaac or SELECT JN=Library Journal). This separation helps avoid false drops in the regular Boolean system, because you will not, for example, retrieve articles authored by Mr. Carpenter when searching for the subject carpenter. TARGET provides two ways of searching these non-subject fields:
1) By putting the prefix search in single quotes (e.g., target 'au=asimov,isaac')
2) An author set created in a Boolean search can be added to a TARGET search (e.g., s au=asimov, isaac; target *s1 'life science' biology zoology)
HOW TO SEARCH WITH TARGET
TARGET can be used in a single database or in multiple databases. Searchers can use a predefined OneSearch grouping or BEGIN in whichever databases they desire. Databases are searched with the CURRENT option by default (current calendar year plus one year) in databases that support the CURRENT feature, but searches can be modified to include other date ranges. If the database does not support CURRENT, TARGET will search the entire database. After beginning in a database or database group, a searcher inputs the word TARGET to get into the TARGET menu search mode. TARGET menu mode provides helps, prompts, and some menu choices to guide the novice user through the search process. Figure 1 shows the beginning of a TARGET menu mode search session.
NOT NATURAL LANGUAGE
Even in the novice TARGET mode, TARGET does not claim to support natural language. It does replace the need for Boolean or proximity connectors, but only the actual words or phrases to be searched should be entered. This differs from Westlaw's WIN, since WIN allows a user to enter a natural language statement directly. WIN's natural language interface supports a search statement such as what is the government's obligation to warn military personnel about their exposure to radiation? The system then strips out common phrases (e.g., "what is the"), identifies legal phrases matched from a phrase thesaurus, and eliminates stopwords. TARGET requires formalized input of major terms, phrases, and synonyms and does little automatic processing. A TARGET statement might look like this: government? obligation warn? ('military personnel' soldier? sailor?) expos? 'radiation. Just as with DIALOG Boolean searching, understanding the required syntax is necessary.
TARGET does not have a thesaurus, so the burden of identifying and inputting synonyms is completely on the user, just as it is in DIALOG's Boolean system. Creating a thesaurus that would serve all of the databases on a supermarket system such as DIALOG would be a daunting task. Westlaw has an easier time of it, building a thesaurus of legal terms and phrases. FREESTYLE has a general synonym-type thesaurus. To make natural language search techniques truly useful for novices, databases and systems will have to spend the time and effort to develop and maintain complete multitopic thesauri.
TARGET MODIFICATIONS AND DISPLAY
Search statements in the TARGET menu mode can be modified by choosing the Modify option (but only after a search is run and after the first three items are displayed). Modifications can be made to add or delete terms, to change the designation of a term as a required term, or to change the dates being searched. TARGET statements build a set which can then be used in a Boolean search.
TARGET examines all of the records that contain any of the input words and calculates likely relevance of each. The formula goes beyond just counting word frequency by comparing how many of the search terms appear in each record with how many times each word appears in the database as a whole Uncommon words that appear frequently in a document are given more weight. Unequal document lengths are taken into account as well as are proximity of search words.
The resulting document ranking is used as the basis for order of display. Unlike Boolean's reverse chronological display or a user-specified sorting order such as alphabetically by author, relevance ranking displays first documents that are most likely to answer the user's query. This is good output for browsing until an information need is satisfied, and for those questions where the user doesn't need a comprehensive search. "Relevance" is always ultimately subjective of course, so there is no guarantee that the fiftieth item displayed will be of less interest in a particular case than the fortieth, or even the first, item.
FREESTYLE
Mead's FREESTYLE is available for both the LEXIS legal service and NEXIS news service. FREESTYLE's performance in LEXIS is best compared to WIN, since LEXIS and Westlaw share many of the same legal databases and compete head-to-head in the legal research market. We chose instead to examine FREESTYLE only in NEXIS, specifically in full-text newspapers. FREESTYLE works on all NEXIS files, either selected individually, selected as NEXIS pre-specified group files, or mixed together in ad hoc groupings by the searcher. After selecting a filename, searchers enter the command .FR to get to FREESTYLE mode. To return to Boolean mode, enter .BOOL.
PLAIN ENGLISH
FREESTYLE is closer to plain English than is TARGET, because it will automatically strip stopwords from an input query. Singulars and plurals are automatically searched (but other word form variations such as past tense and gerunds are not). As with the full NEXIS system, common abbreviations, British/American spelling, and equivalencies (e.g., 4 and four) are also automatic. As with WIN, a FREESTYLE search could be directly entered as what is the government's obligation to warn military personnel of exposure to radiation? or using a shorter, more formalized statement as in TARGET If entered in the former way, "what, is, the, to, of" and "to" will all be discarded as stop (noise) words. (In plain English searches we tested, effect, services, and information were not discarded as stopwords.) Government's will be searched as government, governments, or government's. The other words will be searched as singulars or plurals, but obligation will not be truncated to oblige, warn to warning, exposure to expose or exposed, etc.
In the first version of FREESTYLE (February 15-May 30, 1994), these variations need to be explicitly input by the searcher (oblige obliged obligation) because truncation, other than automatic plurals, is not supported by FREESTYLE. The NEXIS symbols for user-specified truncation (! and *) did not work in the version of the software we tested.
FREESTYLE THESAURUS
Unlike TARGET, FREESTYLE does have an accompanying thesaurus where searchers can look for synonyms or variant word forms to add to their search. The thesaurus is not invoked automatically; searchers must select the thesaurus option from a Search Options screen and specify which of their search terms they want to check for synonyms [4].
SEARCH OPTIONS/RESULTS
After inputting a search statement but before FREESTYLE runs the search, a Search Options screen is displayed. Search Options include viewing the thesaurus, editing the search statement, or running the search as is. Edit choices include adding or deleting search terms or phrases, designating terms as mandatory, or adding restrictions such as date, byline, etc. (If date restrictions are not selected, FREESTYLE defaults to searching the full file. Date edits allow users to specify a specific date or date range.) If more than one edit is desired the process can take a while. The Search Options screen must be entered for each modification and each must be done individually. Command stacking provides a shortcut through the restrictions and allows users to enter more than one option at a time.
Like the asterisk (*) in TARGET, designating a term as mandatory means that the term must be present in any documents retrieved and ranked by FREESTYLE. It adds more precision to the search by combining a Boolean-like search technique with relevance ranking. However, in FREESTYLE the mandatory designation must be made after an initial search statement is entered, and the desired term must be retyped after the mandatory option is selected.
Since NEXIS has one large inverted index, rather than a subject-related basic index and non-subject field additional indexes like DIALOG, authors (bylines) and publication years will be searched if they are entered as part of the initial search statement. Searching for isaac asimov as a byline in FREESTYLE can be done just by entering his name, but documents that include mentions of Isaac Asimov in the text or as a subject will be retrieved in addition to articles written by him. To gain more precision by searching for him only as a byline, use the Restrictions choice on the Search Options screen, followed by selecting byline. This can only be done if you have already entered a basic search query, however. You cannot select an author alone. When the search is run, a Search Results screen is displayed. The screen reports any stopwords that were input in the search statement and any phrases found in the phrase dictionary. It summarizes which terms were designated mandatory and any restrictions applied.
WHERE AND WHY
While DIALOG has included information about the occurrences of words and relevance ranking score as a display option with each record, Mead has chosen to make this diagnostic information part of two separate commands. The WHERE and WHY commands are unique to Mead. WHERE shows which documents contain each of the search terms, and WHY shows the level of importance assigned to each term by the system. If you have changed the display to more than 25 documents, WHERE will only display information about the first 25 documents retrieved in any FREESTYLE search. This will be changed in the new release expected in June 1994. WHERE and WHY have been favorably received, especially by experienced searchers [5]. WHERE helps searchers determine which documents to browse according to their own idiosyncratic view of relevance. WHY helps an experienced searcher determine if a new strategy should be used, if a Boolean search might get better results, or even if they are in the wrong database.
COMPARING TARGET AND FREESTYLE
The main purpose of this article is to compare DIALOG Boolean searching with DIALOG TARGET and NEXIS Boolean with NEXIS FREESTYLE. We did not set out to compare TARGET and FREESTYLE head-to-head, although some comparison is obvious. Most of the differences in the approaches taken by the two systems reflect their differing basic philosophies.
TARGET puts the searcher more in control and does very little automatically. FREESTYLE, on the other hand, does some things automatically and attempts to lead the searcher by the hand a bit more. This is consistent with the different focuses of these systems--experienced searchers for DIALOG and novice end-users for NEXIS.
COMPARING RELEVANCE AND BOOLEAN
An in-depth comparison of these Boolean search engines with relevance search techniques requires testing real questions and searches. This should be done over time by many searchers--we have just scratched the surface. We gathered questions from reference librarians in four libraries and selected six questions to test.
We did all of the searches in the same newspapers, in an ad hoc grouping of the Los Angeles Times, Boston Globe, and Washington Post papers for 1993-1994.
QUESTION #1. What can you find out about EMFs? [Note: EMF = electromagnetic field]
QUESTION #2. Find any mention of Hopis.
QUESTION #3. What is the effect of PCBs on fish? [Note: PCB = polychlorinated biphenyl]
QUESTION #4. Is there abusive behavior and battering in lesbian relationships?
QUESTION #5 Should the state provide emergency medical services for illegal immigrants?
TEST SEARCH RESULTS
On DIALOG, an average precision ratio (relevant retrieved/all retrieved) of 56% was achieved by TARGET, compared to 61% by Boolean. NEXIS results were similar, with 53% for FREESTYLE and 64% by Boolean
The better overall precision with Boolean should be contrasted with the greater number of total documents retrieved and, at times, greater number of relevant documents, obtained by relevance searching.
REFERENCES
[1] Pritchard-Schoch, Teresa. "Natural Language Comes of Age." ONLINE 17,No. 3 (May 1993): pp. 33-43.
[2] Tenopir, Carol. "The New Generation of Online Search Software." Library Journal 117 (October 1, 1993): pp. 67-68.
[3] WIN is not the first commercially available online system to go beyond Boolean. That honor probably belongs to Congressional Quarterly's Washington Alert, which has used the Personal Librarian search engine since 1989.
[4] The FREESTYLE thesaurus is a synonym list thesaurus such as Roget's, not the kind of thesaurus defined in the ANSI (American National Standards Institute) or ISO (International Standards Organization) standards for use with indexing. It lists only synonyms and word form variants, and does not specify term hierarchies, such as broader terms, narrower terms, etc.
[5] Bjorner, Susanne N. "Output Options: The .WHERE and .WHY of FREESTYLE" ONLINE 18, No. 2 (March 1994): pp. 88-91.
>>>>>>>>>>>> Your Assignment <<<<<<<<<<<<<<
Suppose your boss reads the foregoing excerpt from
the Tenopir and Cahn article and shouts "Why didn't they compare them head
to head!?! Here in the Pacific
Northwest, we need to know which works best with The Seattle Times !!"
Before you know it, you have been assigned the job of
comparing Target and FreeStyle in a "head to head" competition with
The Seattle Times. You are to write a
short report detailing how you compared them and your recommendations as
regards their use with The Seattle Times.
Target Strategy:
1. Choose one of the five questions used by Tenopir and
Cahn.
2. Think about what terms you will use in your
"natural language" search, and perhaps more importantly, which terms
will you demand be present, e.g.: the mandatory terms.
3. Search Dialog's file of The Seattle Times first. Begin file 707 and issue the command
"target" to get into the target menu mode:
? b 707
? target
[Note: the time frame of
Dialog's CURRENT feature. This
is crucial because you will want to restrict the FreeStyle search to the same
time frame.]
4. Do your search.
Browse the results and capture the results electronically so that you
can compare the Target results with the FreeStyle results.
FreeStyle Strategy:
1. Do the Dialog Target Search first!
2. Log on to Nexis.
3. Choose the library
REGNWS
4. Choose the file for the Seattle Times
SEATTM
5. Set the searching mode to FreeStyle
.fr
6. Enter the Same search as you did for the Dialog Target Search. You will have to "translate" from
Target to FreeStyle to make sure that the two searches are equivalent.
Note that the Nexis interface strategy is to have you
enter your search terms and after you press the Enter key, to present you with
the following menu:
<=1> Edit Search Description
<=2> Enter/edit Mandatory Terms
<=3> Enter/edit Restrictions (e.g.
date)
<=4> Thesaurus
<=5> Change number of documents
7. Before you command Nexis to do the search, set two
restrictions:
1. Set the document restriction to 50. [Dialog's target automatically sets a
document restriction to 50].
2. Set the date restriction to the same range
of date used by the Dialog search. For
example, suppose that you noted that Dialog's Current feature used the time
frame of 1994 - 1995. Then set the date
restriction on FreeStyle to
AFT 01/01/94
8. Do the search and capture the necessary information
electronically.
Report
Write a short report detailing your search on both Target
and FreeStyle, and compare the search results.
Decide which system is "better" for your search. If your results are inconclusive, consider
the effects on your results if you had altered the search terms (i.e., changed
the mandatory nature of some terms).
You may also want to broaden your results by including the results of
more than one topic.