If the goal of an information system is to retrieve all and only the relevant documents in a collection for a particular query, how does it work when the collection is all documents available on the Web? We cannot evaluate our success based upon recall and precision alone, because there is just no way of knowing how many relevant documents are out there.
Baeza-Yates & Ribeiro-Neto (1999) identified the following problems which also impact retrieving information from the Web:
- Distributed data: documents spread over millions of different web servers.
- Heterogeneous data: multiple media types (images, video), multiple languages, even different alphabets.
- Volatile data: documents can change or disappear rapidly.
- Unstructured and redundant data: no uniform structure, nearly 30% duplicate documents.
- Quality of data: no editorial control, inaccurate information, poor quality writing, typos.
- Large volume: billions of separate documents.
References
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York: Association for Computing (ACM) Press.
Shneiderman, B., & Kearsley G. (1989). Hypertext hands-on: An introduction to a new way of organizing and accessing information. Boston: Addison Wesley.
Singhal, A. (2001). Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4), 35-43. Retrieved March 8, 2008, from http://singhal.info/ieee2001.pdf
1 comment:
In a related vein, it also has to be tagged and described! I'll pass on that task - I'm rooting for del.icio.us.
Post a Comment