Research Topic: Web page classification based on both text and image information

The objective of this project is to investigate whether image information is helpful for page content detection. Most current page detection systems are based on an analysis of textual information and they consequently fail to detect content incorporated inside images. We plan to use machine learning technology to build a trainable system based on both text and image information.

The page content detection problem is formalized as a page classification problem. A Web page data base will be created, and the pages will be manually classified into two groups: target pages and non-target pages. We will randomly choose half of the pages for training and half of the pages for testing. This project will investigate:

This project is collaborative research with Dr. Mengjie Zhang.