The Anti-Thesaurus: A Proposal For Improving Internet Search While Reducing Unnecessary Traffic Loads Hastings Research home > Internet Papers Index > 06-anti-thesaurus.shtml The Anti-Thesaurus: A Proposal For Improving Internet Search While Reducing Unnecessary Traffic Loads Nicholas Carroll Date: November 19, 2001 Modified: N/A (11/26/01 -- I will be posting an expansion of this paper in the next week or two.) Summary In the continual struggle between search engine administrators, index spammers, and the chaos that underlies knowledge classification, we have endless tools for "increasing relevance" of search returns, ranging from much ballyhooed and misunderstood "meta keywords", to complex algorithms that are still far from perfecting artificial intelligence. Proposal: there should be a metadata standard allowing webmasters to manually decrease the relevance of their pages for specific search terms and phrases. ================================= I operate several web sites. Among the many search strings bringing visitors in, of no use to either the searcher or me, are: Victor mousetraps (we don't sell mousetraps, Victor or otherwise) Charles Ponzi (there's a superb bio on the Net; why come here for two lines?) Hannibal (we are not an info site on Hannibal) Matteo Ricci (he's listed in a bibliography; there is no info to speak of) Smalltalk (we have tech info on only one Smalltalk application) I am faintly embarrassed by drawing in these searchers, when I have no useful information for them. Their time was wasted – needlessly. Furthermore, they hog my bandwidth, and clog my log files with useless data. (With better search skills, they might never have arrived at my pages. But I'm not a member of the "they're so stoopid" school of thought. Anyway there are plenty of genius-level people who just aren't wired correctly for search.) Then there are the searchers I would just as soon not know about at all, like the ones looking for: stalking on the Internet The phrase that was bringing them in was "Marketing Myths Stalking the Internet". Potentially even a Google exact-phrase search would have led a searcher to that page, since Google treats "on" and "the" as stopwords, generally ignoring them even if they are within quote marks. A metadata tag to eliminate such irrelevant searches would be quite useful. E.g.: This is a hypothetical tag. At present it does nothing. Don't add it to your pages. would eliminate numerous hits on my servers. I couldn't add "smalltalk" to the list, since that particular page actually does give Smalltalk information, even though it is secondary to the page's subject. (From an information science point of view, the present HTML META="keywords" tag is very loosely speaking a thesaurus, since it provides a place for alternate spellings of words, misspellings, related words – as well as words with similar meaning. That index spammers have widely abused it does not change that original intent. Thus my term, the "anti-thesaurus".) Will They Use It? There are human limitations to this. For one, many webmasters won't hear of it, won't understand it, or just won't bother. After all, how many web sites properly use the robots.txt exclusion standard? The answer is: enough to make it worthwhile. Saving storage alone is of interest. If nothing else, every unwanted page I visit snarfs a small chunk of my disk storage. Many pages add up to a lot of snarfing. Techies have assured me that storage will soon cost $5 per terabyte at CompUSA – and that will take care of storage problems. Perhaps. On the other hand, millions of coders keep cranking out millions of lines of code. Billions of non-coders keep cranking out papers and email. Roger Gregory, who was project leader for Xanadu Green – and was giving thought to the whole planet's storage – saw it somewhat differently: "We concluded that there will never be enough storage." Yet the big payoff would be in reducing transmission loads. Wireless in particular has some rough years ahead, and I for one really don't want to download useless web pages at 19,600kbs. Also, returning to the robots.txt standard: it may be underused simply because it is a security breach (the file openly lists URLs that webmasters do not want visible through search engines). It is possible that many more webmasters would be using it properly, if not for that security problem. (Leaving a page out of the robots.txt file, a la "security by obscurity", is admittedly no guarantee of security. SE spiders could find the URL in another web site's unprotected logs, and crawl it anyway. But many webmasters consider that risk preferable to blatantly listing the URL right in robots.txt for anyone at all to see.) An Anti-Thesaurus is a much more limited security risk. There is little that can be learned from what sort of traffic a site's webmaster does not want. (Yes, one might suspect many things, by viewing the tag's keywords. But it's a pretty big jump from seeing the keywords a webmaster added to an nonwords tag, to predicting corporate strategy.) Will They Misuse and Abuse It? It would unquestionably be misused by some percentage of webmasters. Few webmasters are expert in search, and many would no doubt load a nonwords tag with far more words than actually needed to eliminate the unwanted traffic. Some would accidentally knock pages down the search engine's listings, when the pages were in fact correctly ranked as is. I don't see any obvious way to abuse such a tag on any major scale. That is, I can see plenty of ways to get cute – just as webmasters used to spend hundreds of largely-wasted hours trying to manipulate SEs through the META KEYWORDS tag. (But I haven't done any serious experimentation to look for major security flaws. Feedback welcome.) The Load On the Search Engines Not much. SEs generally store web page data in table format. Nonwords means the addition of one field. As noted in the above examples, a huge percentage of irrelevant search returns can be eliminated by withholding a single word from the searched data. What search engines might lose in storage, they would more than gain in quality and speed. ### Notes 1. Rather than "nonwords", I was tempted to use the term "exwords", as a contraction of "excluded words". Unfortunately, when spoken, it can be heard as "x words", implying variable words. Not that I know what a "variable word" is. But it would be bound to confuse technical people. 2. The "anti-thesaurus" should not be confused with "stop lists," which in information science usually refer to lists of "stop words" – common words such as "the" or "and", to exclude them completely from the search protocol for all searches. If one wants to quibble, I suppose the anti-thesaurus could be called a "content-provider-definable stop word list." But I'd just as soon leave "stop words" to the information retrieval professionals. Please send comments to Nicholas Carroll Email: ncarroll@hastingsresearch.com http://www.hastingsresearch.com/net/06-anti-thesaurus.shtml © 1999-2001 Hastings Research, Inc. All rights reserved.