Projects

Research Topic: Unsupervised learning for information extraction from semi-structured pages

The amount of information on the Web is continuing to grow rapidly and there is an urgent need to create information extraction systems that can turn some of the online information from ``human-readable only'' to ``machine readable''. Information extraction systems that extract data tuples from particular information sources are often called wrappers. Building wrappers by hand is problematic because the number of wrappers needed is huge and the format of many sources is frequently updated. One solution is to be found in wrapper induction systems that learn wrappers from example Web pages.

A lot of wrapper induction systems have been constructed, especially for information extraction from HTML tables and lists; this research differs from most other systems in that our system aims to learn from unlabeled tabular pages.

This project will be based on the work detailed in these two papers:

Xiaoying Gao, Peter Andreae, Richard Collins, "Approximately Repetitive Structure Detection for Wrapper Induction", (PS), the proceedings of 8th Pacific Rim International Conference on Artificial Intelligence (PRICAI 2004) in Auckland, New Zealand from 11 to 13 August, pp. 585-594, 2004.
Xiaoying Gao, Mengjie Zhang, Peter Andreae. Automatic Pattern Construction for Web Information Extraction". (PDF) International Journal of Uncertainty, Fuzziness, and Knowledge Based Systems. Vol. 12, No. 4. 2004