Projects

Research Topic: Using Link Information for Web Page Clustering

The Internet contains a vast amount of information and is growing at an increasing rate. People often use search engines such as Google to search for information they want. The search result provided by a search engine is often a list of thousands, if not millions, of documents. To determine which document is the one they want, users need to check the list page by page manually. Since the required information is often buried deeply in the result list, this is usually a very time consuming process.

One possible solution to this problem is to use Web page clustering techniques to group the search results into categories of documents. Web page clustering is a way to automatically group similar documents together. By adding Web page clustering to search engines, search results can be presented as a small set of topics each of which covers a group of similar documents. Users would then choose only the topic they want and check the much shorter list of documents in that topic. This speeds up the process of finding relevant documents.

There have been many clustering methods developed; however, most of them are not suitable for clustering Web pages. An important property of Web pages is that they are connected together by links and the information in these links is valuable for determine the relevance of a document. Most current clustering methods ignore this property and only consider page content.

The overall goal of this project is to investigate whether link information is helpful to web page clustering. In this project, two kinds of link information will be considered: in-linking pages (the pages that linked to a particular document in the search result) and out-linked pages (the pages that are linked to from a particular document in the search result). The project will develop a new Web page clustering algorithm that uses information of the in-linking pages and the out-linked pages. Specifically, this project will investigate:

How to incorporate the in-link and out-link information in the clustering algorithm and how to combine the link information effectively with the page content for clustering.
Whether this new web page clustering algorithm can outperform the clustering based on only page content.