Research Topic: Unsupervised learning for information extraction from semi-structured pages

The amount of information on the Web is continuing to grow rapidly and there is an urgent need to create information extraction systems that can turn some of the online information from ``human-readable only'' to ``machine readable''. Information extraction systems that extract data tuples from particular information sources are often called wrappers. Building wrappers by hand is problematic because the number of wrappers needed is huge and the format of many sources is frequently updated. One solution is to be found in wrapper induction systems that learn wrappers from example Web pages.

A lot of wrapper induction systems have been constructed, especially for information extraction from HTML tables and lists; this research differs from most other systems in that our system aims to learn from unlabeled tabular pages.

This project will be based on the work detailed in these two papers: