Projects

Research Topic: Site-level Information Extraction

Most information extraction systems are based on the assumptions that the information to be extracted are presented in the same page and each page can be handled isolately. However, this assumption does not hold for some Web documents, where the documents are hypertext and the information to be extracted are spread in multiple pages, typically pages on one site. This research focuses on site-level IE, which aims to extract related information from multiple Web pages on one site. One example would be to start from my school home page, to extract my research details from my home page, to extract my contact details from the ``contact'' page and to extract my research group from ``research'' page, and to extract my teaching information from the ``course'' page.

Site-level information extraction is important based on the following investigation:

One web site contains a set of documents linked together in a hierarchy. The information on linked pages are often related and one site should be treated as a whole information source. For example, a course page linked from a school home page describe the course offered by the school although there might not be the school name on this particular course page.
Information to be extracted may not be complete on one single page. Take an easy example, to extract my office phone number, where my extension number is on my personal home page, but the country code, area code and the school number is on the ``contact'' page which links to our school home page.

Site-level information extraction is a difficult task considering the dynamic nature of Web sites. Every site has a different domain, and a different format. Even for one single site, the format such as th link structure is often updated and sometimes the site content is updated too. Research on information agents has achieved some success in finding specific information on the Web. Most of the information agents are tailored to specific domains and specific tasks.

This research is based on our previous research on information agents and wrapper induction for page level information extraction. We investigate the site level information extraction problem on a large number of domains and a number of different tasks. We also examine a number of different wrapper representation methods and compare them.

Out goal of this research is to develop a language that can describe the target information and how it can be found on sites, also to investigate what heuristics is useful for choosing which link to follow and what kind of search strategy we should use for following the links.

Rather than learning site specific wrappers, our focus is to build wrappers that works on most Web sites in the same domain. We believe that a domain specific wrapper can be generated based on limited amount of domain knowledge and the knowledge-based wrapper can be used on a wide range of Web sites in the same domain. Our research is a knowledge-based approach, we investigate what knowledge is useful for guiding information extraction, how to represent the knowledge and how much knowledge can be learned.