The Anti-Thesaurus: A Proposal For Improving Internet Search While Reducing Unnecessary Traffic Loads
Hastings Research home > Internet Papers Index > 06-anti-thesaurus.shtml 


      The Anti-Thesaurus:
      A Proposal For Improving Internet Search While Reducing Unnecessary 
      Traffic Loads

      Nicholas Carroll
      Date: November 19, 2001
      Modified: N/A (11/26/01 -- I will be posting an expansion of this paper in 
      the next week or two.)
      Summary


      In the continual struggle between search engine administrators, index 
      spammers, and the chaos that underlies knowledge classification, we have 
      endless tools for "increasing relevance" of search returns, ranging from 
      much ballyhooed and misunderstood "meta keywords", to complex algorithms 
      that are still far from perfecting artificial intelligence.
      Proposal: there should be a metadata standard allowing webmasters to 
      manually decrease the relevance of their pages for specific search terms 
      and phrases.
      =================================
        I operate several web sites. Among the many search strings bringing 
        visitors in, of no use to either the searcher or me, are: 
        Victor mousetraps (we don't sell mousetraps, Victor or otherwise) 
        Charles Ponzi (there's a superb bio on the Net; why come here for two 
        lines?) 
        Hannibal (we are not an info site on Hannibal) 
        Matteo Ricci (he's listed in a bibliography; there is no info to speak 
        of) 
        Smalltalk (we have tech info on only one Smalltalk application) 
      I am faintly embarrassed by drawing in these searchers, when I have no 
      useful information for them. Their time was wasted – needlessly. 
      Furthermore, they hog my bandwidth, and clog my log files with useless 
      data. (With better search skills, they might never have arrived at my 
      pages. But I'm not a member of the "they're so stoopid" school of thought. 
      Anyway there are plenty of genius-level people who just aren't wired 
      correctly for search.)
      Then there are the searchers I would just as soon not know about at all, 
      like the ones looking for:
           stalking on the Internet
      The phrase that was bringing them in was "Marketing Myths Stalking the 
      Internet". Potentially even a Google exact-phrase search would have led a 
      searcher to that page, since Google treats "on" and "the" as stopwords, 
      generally ignoring them even if they are within quote marks.
      A metadata tag to eliminate such irrelevant searches would be quite 
      useful. E.g.:
      <meta name="nonwords" content="victor, ponzi, hannibal, matteo">
      This is a hypothetical tag. At present it does nothing. Don't add it to 
      your pages.
      would eliminate numerous hits on my servers. I couldn't add "smalltalk" to 
      the list, since that particular page actually does give Smalltalk 
      information, even though it is secondary to the page's subject.
      (From an information science point of view, the present HTML 
      META="keywords" tag is very loosely speaking a thesaurus, since it 
      provides a place for alternate spellings of words, misspellings, related 
      words – as well as words with similar meaning. That index spammers have 
      widely abused it does not change that original intent. Thus my term, the 
      "anti-thesaurus".)


      Will They Use It?
      There are human limitations to this. For one, many webmasters won't hear 
      of it, won't understand it, or just won't bother. After all, how many web 
      sites properly use the robots.txt exclusion standard?
      The answer is: enough to make it worthwhile.
      Saving storage alone is of interest. If nothing else, every unwanted page 
      I visit snarfs a small chunk of my disk storage. Many pages add up to a 
      lot of snarfing. Techies have assured me that storage will soon cost $5 
      per terabyte at CompUSA – and that will take care of storage problems. 
      Perhaps. On the other hand, millions of coders keep cranking out millions 
      of lines of code. Billions of non-coders keep cranking out papers and 
      email. Roger Gregory, who was project leader for Xanadu Green – and was 
      giving thought to the whole planet's storage – saw it somewhat 
      differently: "We concluded that there will never be enough storage."
      Yet the big payoff would be in reducing transmission loads. Wireless in 
      particular has some rough years ahead, and I for one really don't want to 
      download useless web pages at 19,600kbs.
      Also, returning to the robots.txt standard: it may be underused simply 
      because it is a security breach (the file openly lists URLs that 
      webmasters do not want visible through search engines). It is possible 
      that many more webmasters would be using it properly, if not for that 
      security problem.
      (Leaving a page out of the robots.txt file, a la "security by obscurity", 
      is admittedly no guarantee of security. SE spiders could find the URL in 
      another web site's unprotected logs, and crawl it anyway. But many 
      webmasters consider that risk preferable to blatantly listing the URL 
      right in robots.txt for anyone at all to see.)
      An Anti-Thesaurus is a much more limited security risk. There is little 
      that can be learned from what sort of traffic a site's webmaster does not 
      want. (Yes, one might suspect many things, by viewing the tag's keywords. 
      But it's a pretty big jump from seeing the keywords a webmaster added to 
      an nonwords tag, to predicting corporate strategy.)


      Will They Misuse and Abuse It?
      It would unquestionably be misused by some percentage of webmasters. Few 
      webmasters are expert in search, and many would no doubt load a nonwords 
      tag with far more words than actually needed to eliminate the unwanted 
      traffic. Some would accidentally knock pages down the search engine's 
      listings, when the pages were in fact correctly ranked as is.
      I don't see any obvious way to abuse such a tag on any major scale. That 
      is, I can see plenty of ways to get cute – just as webmasters used to 
      spend hundreds of largely-wasted hours trying to manipulate SEs through 
      the META KEYWORDS tag. (But I haven't done any serious experimentation to 
      look for major security flaws. Feedback welcome.)


      The Load On the Search Engines
      Not much. SEs generally store web page data in table format. Nonwords 
      means the addition of one field. As noted in the above examples, a huge 
      percentage of irrelevant search returns can be eliminated by withholding a 
      single word from the searched data. What search engines might lose in 
      storage, they would more than gain in quality and speed.
      ###
      Notes
      1. Rather than "nonwords", I was tempted to use the term "exwords", as a 
      contraction of "excluded words". Unfortunately, when spoken, it can be 
      heard as "x words", implying variable words. Not that I know what a 
      "variable word" is. But it would be bound to confuse technical people.
      2. The "anti-thesaurus" should not be confused with "stop lists," which in 
      information science usually refer to lists of "stop words" – common words 
      such as "the" or "and", to exclude them completely from the search 
      protocol for all searches. If one wants to quibble, I suppose the 
      anti-thesaurus could be called a "content-provider-definable stop word 
      list." But I'd just as soon leave "stop words" to the information 
      retrieval professionals.

      Please send comments to Nicholas Carroll
      Email: ncarroll@hastingsresearch.com

      http://www.hastingsresearch.com/net/06-anti-thesaurus.shtml
      © 1999-2001 Hastings Research, Inc. All rights reserved.