Research Topic: Machine learning for Automatic Metadata Extraction from Web pages

A very important task for creating a digital library is to extract/record the Metadata of Web pages such as title, date, identifier, subject, description,language, etc. This project addresses two problems: if a web site has metadata such as embedded metadata in META tags, our system should be able to transform the metadata into the format that we wanted; if a web site does not provide any metadata, our system should be able to extract information from the resource and fill in most of the slots in the metadata template.

Currently only a small number of Web sites provide metadata and the metadata can be presented in many different formats. For example, metadata may be embedded in Web pages using META tags, or in a separate HTML document linked to the resource, or in a database linked to the resource. The metadata may use different schemas such as Dublin core, AACR2, GILS, EAD, IMS, AGLS and the metadata can be presented in different forms such as HTML, SGML, XML, RDF, MARC, and MIME. It is not trivial to convert (extract) metadata from different forms and different formats into the schema designed by the National Library.

Most Web sites do not provide metadata, so there is an urgent need to build a computer system to automatically exact metadata from web resources. Some of the metadata can be extracted from the Web pages such as title, date, identifier, subject, description and some can be decided based on the Web pages such as language, format, etc. Our system should be able to automatically extract the related information and fill in most of the slots of the schema template and allow editing. The information extraction is very hard due to the great diversity of page contents and formats.

Both converting/extracting existing metadata and extracting metadata from resources involve information extraction from large numbers of documents with great diversity in content and formats. One solution is to use machine learning technology to automatically learn the patterns for metadata extraction. Patterns can be learned from training examples such as web pages in which the metadata to be extracted are manually labelled. A trainable system can be created and the learned pattern can be used for future metadata extraction.

This project will start with an investigation of the tools and systems available on the Internet and then design a system that will extract metadata and store them using the metadata schema designed by the NZ National Library.

I am happy to supervise one honours student to carry out this project. I have supervised two honours/MCompSci students to complete their projects in the area of machine learning for information extraction and I have over 10 publications in the area of Web intelligence.