Searching for information

With millions of web pages available it is virtually impossible to find the one you need without the help of a search engine. Search engines operate by retrieving (spidering) all pages from a site and indexing the words therein to a database. A query is then compared against the database and links to matching pages are presented to the user.

Matching pages to a query

A difficult task is to determine the relevance of the matching pages to a query. Since no standard exists to indicate the relevance of a page (and any such standard is doomed to fail due to abuse because of unscrupulous website owners wanting to be number one in any query), search engine employs various heuristics with varying degrees of success. It is desirable to have a heuristic with a high probability of success.

Some known heuristics are:

One heuristic which appears to work exceptionally well is employed by Google . This search engine computes a so-called PageRank (patent pending, not published) for every page it indexes, and orders the results by this PageRank.

The PageRank is computed based on two assumptions. First, a page to which many pages link should get a higher ranking. Second, a link from a page with a high ranking is worth more than a link from low ranked page. The importance of a page is also determined from anchor text, font size, location in the document, and proximity. All of these factors are combined to compute the overall relevance of a page. This computation corresponds to computing the principal eigenvector of the normalized link matrix of the web (which is not for the faint of heart -- see the PageLink paper for details). The results are surprisingly accurate.

Another method for ranking documents is used by Altavista (US 6,112,203). Using this method, a graph is created of all the indexed documents based on the way they link to each other. The graph is analyzed, and documents are grouped based on the fact that they have a large number of links in common. Such a group is then related to a particular topic, which can be found by scanning for common words and themes. In each group, each document is assigned a weight based on its relevance to the topic. When the user then enters a query, only the most relevant documents from groups related to the query are presented.

When the documents in question comprise music, indexing and matching is much more difficult than with textual documents. To allow searching, identifiers can be embedded in pieces of music using watermarking techniques. This is known as content identification. A server can now automatically detect the identifier and perform a database lookup to determine the title, artist and so on the piece of music in question. However, this has the disadvantage that the watermark must be inserted beforehand.

As an alternative, music recognition technologies have been developed. These technologies attempt to detect characteristic features in the piece of music and compute a signature or hatch based on the features a signature isn't stored in the database. When a piece of music is received, its signature is computed and compared against the signatures stored in the database. This way, no identifying information needs to be present in music itself, although reliably detecting characteristic features is very difficult technically. Further, the database lookup can be very time-consuming.

Watermarks and hashing technologies allow many new business opportunities. For example, a service could be offered in which music is recognized for a fee. The service could operate by letting a consumer call a special phone number to transmit a piece of music to a server using the microphone in his mobile telephone. The server comptues the hash or detects the watermark.Using this information, the server then performs a database lookup to extract identifying information and sends the identifying information to the consumer, for example as an SMS message.

Search result clustering

When presenting the results of a query to the user, it is important to avoid cluttering the presentation with multiple documents that are (virtually) the same. To this end, documents that match the query are clustered. One simple way to cluster is to group documents by site or common URL prefix. When presenting results, only a limited number of documents having a common URL prefix are included, preferably with the option to show more results from that site. This approach is based on the assumption that documents on one site are probably very much related, and the user will be able to find the most relevant one by starting at the one shown.

This approach does not solve the problem of mirrors, where the same document is made available from multiple independent sites, possibly in different languages. In such a case, the document will be listed multiple times, once for each mirror. A text-based comparison can be employed to find these mirrors, but this is not always accurate. Often, mirrored documents contain extra text referring to the original site, or an indication that they are a mirror. This extra text can easily cause an automated comparison to fail. And of course a translation in another language is always marked as a different document.

Altavista solved this problem by comparing the hyperlinks in both documents (US 6,138,113). If both documents link to the exact same set of other pages, then these pages are considered to be the same and only one of them is shown.

Another clustering algorithm compares the frequency of words in both versions, and clusters documents if many words have almost the same frequency in every document. A dictionary can be employed to roughly translate documents in different languages before comparing frequencies. Since frequency counting is typically only done after words have been converted to their normal form ("taught" is converted to "teach", for instance), a rough dictionary-based translation is sufficient for this purpose.

The relatedness of two documents can also be derived from the fact that they relate to the same topic (as determined using summarizing algorithms), and link to each other, or both are linked to from a common third document.

These clustering algorithms can also be used to generate a listing of documents that are similar to a chosen one. Many search engines offer with each result the option to "show similar documents". This allows a user to easily narrow down his search.

Solutions to common search engine problems

Some known solutions to common problems are:

Value-enhancing features

Further, it is desirable to offer features that make the search engine more attractive to use. Some known features are:

Downloading documents

A crawler is a program that retrieves webpages, commonly for use by a search engine or a web cache. Basically, the crawler takes an initial webpage, extracts all the URLs in it, and retrieves those. It then repeats this procedure for every webpage retrieved. It is clear that this procedure generates a large amount of network traffic, and cannot succeed in retrieving every document on the WWW. And by the time it has retrieved a substantial portion of the WWW, the pages it has retrieved will have been out of date anyway.

Further, it is desirable that a crawler does not overload a single site by repeatedly downloading webpages at high speed.

Determination of document topics

When presenting the results of a query, it is desirable to show a short summary of what each result is about. The most common way to do this is to show the first few sentences or words from the document in question. However, these words often are not related to the topic of the document, but are advertisements, navigational elements ("Home", "People", Search", and so on) or other generic texts. This makes them a suboptimal choice for summarizing the document.

To overcome this problem, the Hypertext Markup Language provides a tag with which an abstract can be provided. The author of a document simply adds to the document, and a search engine can use that as summary when presenting the document. Unfortunately, a substantial number of authors does not accurately summarize documents.

Another approach, which requires storing the full text of every document that was indexed, is to show sentences from the document in which at least one word from the query occurs. This allows the user to quickly determine whether the document is relevant to his query (instead of being an accidental match, for instance because the word in the document is used in a different sense than in the query).