4.05.2008

The Web as an Information Retrieval System

In traditional database systems, the retrieval of information depends in large part on the organization of the database used for the search, and the description of the data it contains. This structured environment make the retrieval procedure a lot easier. However, the Web was not designed as a system for the retrieval of organized information. Instead, it evolved into a dynamic and unstructured (and largely uncontrolled) archive of the world’s digital documents (Singhal, 2001). Although the Web has been described as a hypertext database by some (Shneiderman & Kearsley, 1989), others disagree, arguing that a true hypertext database has a conceptual model behind it which provides organization and consistency. This is hardly true of the Web, or even of individual documents on the Web (Baeza-Yates & Ribeiro-Neto, 1999).

If the goal of an information system is to retrieve all and only the relevant documents in a collection for a particular query, how does it work when the collection is all documents available on the Web? We cannot evaluate our success based upon recall and precision alone, because there is just no way of knowing how many relevant documents are out there.

Baeza-Yates & Ribeiro-Neto (1999) identified the following problems which also impact retrieving information from the Web:
  • Distributed data: documents spread over millions of different web servers.
  • Heterogeneous data: multiple media types (images, video), multiple languages, even different alphabets.
  • Volatile data: documents can change or disappear rapidly.
  • Unstructured and redundant data: no uniform structure, nearly 30% duplicate documents.
  • Quality of data: no editorial control, inaccurate information, poor quality writing, typos.
  • Large volume: billions of separate documents.
Information professionals have always needed to evaluate any information they retrieved. Now, in spite of the remarkable progress made in developing sophisticated search tools on the Web, the need to assess the reliability of retrieved information is more important than ever.

References

Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York: Association for Computing (ACM) Press.

Shneiderman, B., & Kearsley G. (1989). Hypertext hands-on: An introduction to a new way of organizing and accessing information. Boston: Addison Wesley.

Singhal, A. (2001). Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4), 35-43. Retrieved March 8, 2008, from http://singhal.info/ieee2001.pdf

1 comment:

Ken said...

In a related vein, it also has to be tagged and described! I'll pass on that task - I'm rooting for del.icio.us.