Internet search result probabilities: Heaps' law and word associativity

Jonathan C. Lansey, Bruce Bukiet

Research output: Contribution to journalArticlepeer-review

10 Scopus citations

Abstract

We study the number of internet search results returned from multi-word queries based on the number of results returned when each word is searched for individually. We derive a model to describe search result values for multi-word queries using the total number of pages indexed by Google and by applying the Zipf power law to the words per page distribution on the internet and Heaps' law for unique word counts. Based on data from 351 word pairs each with exactly one hit when searched for together, and a Zipf law coefficient determined in other studies, we approximate the Heaps' law coefficient for the indexed worldwide web (about 8 billion pages) to be β = 0.52. Previous studies used under 20,000 pages. We demonstrate through examples how the model can be used to analyse automatically the relatedness of word pairs assigning each a value we call "strength of associativity". We demonstrate the validity of our method with word triplets and through two experiments conducted 8 months apart. We then use our model to compare the index sizes of competing search giants Yahoo and Google.

Original languageEnglish (US)
Pages (from-to)40-66
Number of pages27
JournalJournal of Quantitative Linguistics
Volume16
Issue number1
DOIs
StatePublished - Feb 1 2009

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Linguistics and Language

Fingerprint Dive into the research topics of 'Internet search result probabilities: Heaps' law and word associativity'. Together they form a unique fingerprint.

Cite this